Preparation Script for Training on Mozilla Commonvoice #111

RuntimeRacer · 2023-05-01T17:31:47Z

This PR provides an end-to-end preparation script for Mozilla CommonVoice.

I built it by copying over the Scripts from AIShell and combining it with the preparation scripts for commonvoice found in Icefall which is also using Lhotse. References:

Some additional Info and stats:

The data for the 24 languages included in the script (there's even more available in the full CV Corpus) is 432G and downloading and extracting the archives took about 12h with my 200 Mbps connection, using a Raid 0 drive consisting of 2xPCI-E x4 .M2 SSDs.
Preparation + Tokenization also took about another day.
I had to cut down the train/dev datasets of all the languages downloaded to use 400 sample each from their dev and train subsets, because otherwise it would have become too big and get stale in a loop on the validation step. Even with the now 9.600 cuts per dev/train set, it takes about 30 seconds to calculate validation loss. In case you want to train on smaller subset of languages, you may want to increase that number or use a complete train/dev set from that language(s).
I was able to run inference training fine with up to 5 GPUs, however there seems to be still a bug in the validation calculation (training problem #86), which required me to only use 1 card as of now, and I hit an OOM error (Cuda OOM error when "saving batch" #110) after ~164k steps, probably due to max-duration 80 being too high for this dataset (running on RTX 3090 24GB).

Since I did not finish training yet, I cannot provide any sample models, results or stats at this point.

…uages, but not sure how to otherwise check for individual languages downloaded and unpacked already instead rn

…loss

eschmidbauer · 2023-05-02T13:35:46Z

anyway you can limit this to english only? I tried this branch and it filled up my disk.

RuntimeRacer · 2023-05-02T14:31:02Z

anyway you can limit this to english only? I tried this branch and it filled up my disk.

For english only just limit the language list here to contain only "en": https://github.com/lifeiteng/vall-e/pull/111/files#diff-9c086567a8bee92cd4ae661ae5d75be66ae5340f8982cb287a09a78aee2041bdR22
You might want to comment out this lines also: https://github.com/lifeiteng/vall-e/pull/111/files#diff-9c086567a8bee92cd4ae661ae5d75be66ae5340f8982cb287a09a78aee2041bdR116-R125
And remove the _subset from the test / dev datasets, too, so you have a bigger test and validation dataset: https://github.com/lifeiteng/vall-e/pull/111/files#diff-9c086567a8bee92cd4ae661ae5d75be66ae5340f8982cb287a09a78aee2041bdR134-R135

eschmidbauer · 2023-05-02T14:37:33Z

Thanks! I will give it a try

lifeiteng · 2023-05-05T09:40:02Z

egs/commonvoice/prompts/ch_24k.txt

@@ -0,0 +1 @@
+甚至 出现 交易 几乎 停滞 的 情况


update prompts

lifeiteng · 2023-05-05T09:44:58Z

Inorder to get reasonable result, we need design the multi-language Symbol set, work with Language ID.

lifeiteng · 2023-05-11T07:54:37Z

egs/commonvoice/prepare.sh

+      cutsDevList+="${audio_feats_dir}/commonvoice_cuts_${lang}_dev_subset.jsonl.gz "
+      cutsTestList+="${audio_feats_dir}/commonvoice_cuts_${lang}_test_subset.jsonl.gz "
+    done
+    # echo "${cutsTrainList}" # debug


clean comments

lifeiteng · 2023-05-11T07:56:59Z

@RuntimeRacer please update prompts.

Can you share the results here?

RuntimeRacer · 2023-05-14T01:56:19Z

@RuntimeRacer please update prompts.

Can you share the results here?

@lifeiteng I just started training the NAR model today; I will share results in a bit once first few epochs have been completed.

Will also update comments then.

pawel-polyai · 2023-05-17T17:23:54Z

@RuntimeRacer - any updates on the training performance?

RuntimeRacer · 2023-05-17T22:23:37Z

@pawel-polyai It's currently training NAR Epoch 4 on 6x RTX 3090, after training 10 Epochs for AR; Intermediate Results are mediocre so far; it is able to Synthesize Speech (Tested only English and German), however it is still not able to fully maintain Speaker Identity nor accent.

For Example, after NAR Epoch 1 it spoke with seemingly slavic accent; After 2 and 3 it Changed to French for some reason; so not sure how precise it can get yet, nor if the Accent it is speaking with is coming from an attempted transfer of Speaker Identity, randomly based on last trained Training data, or Dataset Bias.

However The Loss is still decreasing for NAR and I'll keep you updated. Sharing Traning graphs here as well:

Also sharing my (very not-in-depth) Examples; Only tested with one speaker which I found TTS Models to have a hard time replicating in the past; and also the intermediate models NAR Epoch 1-3: https://drive.google.com/drive/folders/1-bCwvXdXd4O2NOBigoXVdArAnZoigvWc?usp=sharing

If you want to play around with it yourself, you can perform inference with these commands:
python bin/infer.py --output-dir ./exp/valle/results --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "You're in the presence of Suzette!" --audio-prompts ./prompts/Suzette_crop1.wav --text "I am the magnificent dark princess of the netherworld." --checkpoint exp/valle/epoch-3.pt

python bin/infer.py --output-dir ./exp/valle/results --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "You're in the presence of Suzette!" --audio-prompts ./prompts/Suzette_crop1.wav --text "Ich bin die großartige Prinzessin der Ätherwelt." --checkpoint exp/valle/epoch-3.pt

chenjiasheng · 2023-05-19T03:11:43Z

Thank you for your detailed and valuable share.
The sharp swings of the your train and valid loss/acc curves seem abnormal.
Generally, after 400k steps of training, even for large data sets, the NAR ACC will be more than 68%.
So there may be some issues in your data pipeline?
@lifeiteng What do you think?

RuntimeRacer · 2023-05-19T09:19:13Z

Thank you for your detailed and valuable share. The sharp swings of the your train and valid loss/acc curves seem abnormal. Generally, after 400k steps of training, even for large data sets, the NAR ACC will be more than 68%. So there may be some issues in your data pipeline? @lifeiteng What do you think?

I train this model on 20 different languages. So I believe it has issues handling some of these, or the dialects of a certain subset of the data. I believe it is still improving in accuracy though.
If it turns out to come about badly / inaccurate after 10 epochs still; I will stop the multilang experiment and continue with librilight english only instead.

lifeiteng · 2023-05-19T09:32:00Z

VALL-E(this repo) focus on single language, in order to support mulit-lang, we should design the some experiments and verify them.

If this PR can get reasonable results, I'm OK to approve it.
Anyway, Thanks to @RuntimeRacer contrib.

@chenjiasheng we can synthesize some audio to judge the effectiveness of the model & data pipeline.
66% seems fine, the model haven't converge

RuntimeRacer · 2023-05-19T12:22:35Z

VALL-E(this repo) focus on single language, in order to support mulit-lang, we should design the some experiments and verify them.

If this PR can get reasonable results, I'm OK to approve it. Anyway, Thanks to @RuntimeRacer contrib.

@chenjiasheng we can synthesize some audio to judge the effectiveness of the model & data pipeline. 66% seems fine, the model haven't converge

Yes it is still in the process of converging; I believe even the 10 Epoch AR Model was way from being fully converged; so it might be worth the effort to do another Follow-up training on Stage 1 / AR model.

@lifeiteng Do you think I can just continue AR training independently later despite NAR model has already been trained; and after like ~10 more Epochs AR (which would be 20 in total) do another 10 Epochs on NAR for fine-tune?

So current Iteration would be 10 AR / 10 NAR

and later 20 AR / 20 NAR for example

lifeiteng · 2023-05-19T13:10:01Z

@RuntimeRacer NAR need more epochs than AR. You can switch to train AR.
Now you can try synthesis audio by infer.py --continoul which can verify if NAR model works.

RuntimeRacer · 2023-05-20T10:09:45Z

@lifeiteng I am confused now. I tried to restart AR Training from the checkpoint I already had trained NAR on. Used this command:
python3 bin/trainer.py --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 --num-buckets 6 --dtype bfloat16 --save-every-n 2500 --valid-interval 2500 --model-name valle --share-embedding true --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 --base-lr 0.05 --warmup-steps 200 --average-period 0 --num-epochs 40 --start-epoch 11 --start-batch 0 --accumulate-grad-steps 4 --world-size 6 --keep-last-k 40 --exp-dir exp/valle --manifest-dir /workspace/kajispeech-v2/commonvoice/data/tokenized --text-tokens /workspace/kajispeech-v2/commonvoice/data/tokenized/unique_text_tokens.k2symbols --oom-check false

It also said it loaded from the existing epoch-10.pt file, which contains 5 epochs of NAR.

But now after a few hours I checked Graphs and outputs, and it kinda started completely from scratch now:

All the checkpoints contain only AR weights according to size, and started off from step 0.
However the graph implicates to me that it in fact did not start completely from scratch I believe; it converges from loss 2.7 - but not fully sure:

And well, NAR weights seem to be completely erased in the new checkpoints according to size.

chenjiasheng · 2023-05-21T02:05:33Z

@RuntimeRacer
Hey, bro, please just train AR and NAR independently, using stage=1 and stage=2 respectively.
No need to rely on the tricky checkpoint transfer mechanism between AR and NAR.

@lifeiteng
How about let's disable train_stage 0, and add both AR and NAR checkpoint to args of infer.py, instead of a single merged checkpoint?
I see many people confused here, including myself two weeks ago.
Maybe @RuntimeRacer could make a PR? If not, I will.

lifeiteng · 2023-05-21T04:20:24Z

@RuntimeRacer @chenjiasheng Yes, we can do better!

There exists a trick in current impl.,
First --train-stage 1 -> best-valid.pt
Then, cp best-valid.pt to epoch-2.pt, the train with --start-epoch 3=2+1, which reload weights trained with --train-stage 1 then start train NAR weights

PR is welcome! I didn't have time to hands on it now.

RuntimeRacer · 2023-05-23T23:22:46Z

Small update on AR model training progress:

It's chewy but continues to converge. I'll keep it running now until valid loss stops converging eventually.

RuntimeRacer and others added 14 commits April 27, 2023 13:39

bootstrap cv folder

6cdd205

Download and Manifest Generation for CV

a0e95f5

add japanese language

de614d3

add japanese language

d7d3c78

full cv prepare script (I hope)

b8164e3

loop through language list

a605eb6

little more verbose logging. The loop breaks lhotse tqdm for all lang…

cf21f1b

…uages, but not sure how to otherwise check for individual languages downloaded and unpacked already instead rn

rename manifests to match tokenizer pattern

5feac82

build all parts for all the commonvoice datasets.

4138586

resampling

a0187f4

fix missing tokenization directive for cv

4bdb1cb

final combination of train sets

768e90c

reduce test and dev set size so we can actually compute a validation …

0f33607

…loss

Updated Dataset statistics in Readme.md

9e0b417

RuntimeRacer mentioned this pull request May 1, 2023

Pretrained Model #94

Open

corrected aishell title and references

add9143

RuntimeRacer mentioned this pull request May 4, 2023

[Feature request] Add Arabic speaker #114

Closed

lifeiteng requested changes May 5, 2023

View reviewed changes

egs/commonvoice/prompts/ch_24k.txt

@@ -0,0 +1 @@

甚至出现交易几乎停滞的情况

Copy link

Owner

lifeiteng May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update prompts

lifeiteng approved these changes May 11, 2023

View reviewed changes

RuntimeRacer mentioned this pull request May 14, 2023

Training stability: Continue training even if a data batch was hit which causes OOM #113

Closed

hulsmeier mentioned this pull request Sep 7, 2023

Error Training on Commonvoice Spanish #165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparation Script for Training on Mozilla Commonvoice #111

Preparation Script for Training on Mozilla Commonvoice #111

RuntimeRacer commented May 1, 2023 •

edited

Loading

eschmidbauer commented May 2, 2023

RuntimeRacer commented May 2, 2023

eschmidbauer commented May 2, 2023

lifeiteng May 5, 2023

lifeiteng commented May 5, 2023

lifeiteng May 11, 2023

lifeiteng commented May 11, 2023

RuntimeRacer commented May 14, 2023

pawel-polyai commented May 17, 2023

RuntimeRacer commented May 17, 2023

chenjiasheng commented May 19, 2023

RuntimeRacer commented May 19, 2023 •

edited

Loading

lifeiteng commented May 19, 2023

RuntimeRacer commented May 19, 2023

lifeiteng commented May 19, 2023

RuntimeRacer commented May 20, 2023 •

edited

Loading

chenjiasheng commented May 21, 2023 •

edited

Loading

lifeiteng commented May 21, 2023

RuntimeRacer commented May 23, 2023

Preparation Script for Training on Mozilla Commonvoice #111

Are you sure you want to change the base?

Preparation Script for Training on Mozilla Commonvoice #111

Conversation

RuntimeRacer commented May 1, 2023 • edited Loading

eschmidbauer commented May 2, 2023

RuntimeRacer commented May 2, 2023

eschmidbauer commented May 2, 2023

lifeiteng May 5, 2023

Choose a reason for hiding this comment

lifeiteng commented May 5, 2023

lifeiteng May 11, 2023

Choose a reason for hiding this comment

lifeiteng commented May 11, 2023

RuntimeRacer commented May 14, 2023

pawel-polyai commented May 17, 2023

RuntimeRacer commented May 17, 2023

chenjiasheng commented May 19, 2023

RuntimeRacer commented May 19, 2023 • edited Loading

lifeiteng commented May 19, 2023

RuntimeRacer commented May 19, 2023

lifeiteng commented May 19, 2023

RuntimeRacer commented May 20, 2023 • edited Loading

chenjiasheng commented May 21, 2023 • edited Loading

lifeiteng commented May 21, 2023

RuntimeRacer commented May 23, 2023

RuntimeRacer commented May 1, 2023 •

edited

Loading

RuntimeRacer commented May 19, 2023 •

edited

Loading

RuntimeRacer commented May 20, 2023 •

edited

Loading

chenjiasheng commented May 21, 2023 •

edited

Loading